Cognizant BFS Innovation – RapidPrototypes

VAST 2011 Challenge
Mini-Challenge 1 - Characterization of an Epidemic Spread

Authors and Affiliations:

Harshawardhan Nene, Cognizant, harshawardhan.nene@gmail.com  [PRIMARY contact]

Ramanand Janardhanan Cognizant, ramanand.janardhanan@cognizant.com

Kedar Vaidya, Cognizant, kedar.ranjan-vaidya@cognizant.com

Raju Bhosale, Cognizant, raju.bhosale@cognizant.com

Vaishali Narkhede, Cognizant, vaishali.narkhede@cognizant.com

Tool(s):

Home-grown tools for Tokenization, slang normalization, stop word removal

JWNL and Wordnet: Lemmatisation

Hepple POS Tagger: Part-of-speech tagging

MS Excel, NodeXL , Unix shell tools and Openshot video editor. 

Protovis, Processing, controlP5 (processing library)

MySQL

Video:

 

Video

 

 

ANSWERS:


MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.

The number of microblog posts mentioning disease symptoms displays a sudden surge towards the end of Tue, 17th May 2011 and continues till Fri, 20th May. The centroids of posts containing each individual symptom are calculated and a distance matrix between them is used to generate a graph of symptoms. Playing with the distance threshold connecting two symptom nodes allows us to divide the symptoms into two distinct groups (Figure 1). Assuming that the group with ‘fever’ and ‘chills’ belongs to the epidemic, we use our map visualization tool to plot the posts containing any of the symptoms in that group on the Vastopolis map. In Figure2, we can clearly see 3 large concentrations in Downtown in the following locations: Vastopolis Dome, Vastopolis City Hospital and the Convention Center.

symptoms.png

Figure 1: Symptom graph

Figure 2: Epidemic outbreak

 

 

 


MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is being transmitted. For example, is the method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy treatment resources outside the affected area? Explain your reasoning.

Text preprocessing and analysis

Finding interesting keywords from tweets

 

Preprocessing:

Tweet -> Tokenize -> Normalize Slang -> POS Tagging -> select words whose part of speech is Noun, Adjective, or Verb -> get their root form (lemma)

 

Output: candidate keywords (unigrams and bigrams) along with their tweet ids (ignore stop words; for bigrams, ignore those phrases whose both words are stop words)

 

Keyword Selection

We ordered the candidate keywords by frequency. We selected the top 500 noun unigrams, the top 200 verb and adjective unigrams, and the top 200 bigrams. After some more processing and manual sifting, we selected 855 keywords as being the most interesting keywords.

 

Finding significant topics

 

We looked at various ways to visualize the keywords such as to identify what kind of significant activities had occurred in the area in the given time period. Eventually, we drew a timeline (by day) for each keyword (we used Protovis for this).

Each row represents the timeline for a keyword and has 21 cells (one per day). The colour of a cell is proportional to the number of tweets that refer to that topic (light to dark). In the figure below, a word like ‘movie’ appears a great deal on most days and is thus perhaps not so interesting but ‘chill’ suddenly flares up at the end.

Figure 1: Visualizing significant topics

 

 

Using a combination of parameters such as minimum frequency, spikes etc, with some manual attention, we filtered in only 104 keywords of interest.

Next, we normalized the timeline (by dividing each cell by the highest value in that row) and selected those cells that have at least 20% of the keyword’s total contribution in a single day as unusual spikes in that day.

 

This visual analysis helped us identify:

1.       Interesting Keywords

2.       Days when significant events happened. The event list below was used by the map visualization after slight manual pruning.

 

Day

Interesting Keywords

1

 

2

basketball

3

accident

4

 

5

convention

6

 

7

car accident

8

music festival

9

 

10

baseball

11

 

12

convention,hook

13

 

14

airport,comment,crash,crew,destruction,emergency,ground,plane,sky

15

building

16

 

17

area,baseball,bomb,building,city,comic convention,cover,crew,drill,emergency,explosion,hook,police,stay,threat,toe

18

accident,blast boom explosion,cloud,ground,smog town,traffic,truck,wind

19

basketball,blow,shortness breath,chill,cold,technology convention,crap,declining health,fatigue,feeling,fever,headache,hook,medicine,rest,sick,sleeep,sleep,unwell

20

bathroom toilet nausea diarrhea stomach tummy ab pain ache,+aching +muscles,appetite,blow,shortness breath breathing right,chills,christian,cool,cough,cramp,crap,crazy,+declining +health,+dry +cough,fatigue,+feeling +better,fever,flu,headache,heartburn,medicine,nap,new sickness,pain,+runny +nose,sick,+sore +throat,sweat,traffic,truck,unwell

21

ab pain,ab,aching muscles,atrocious cough,bad diarrhea,blood,blow,brain,breath,breathing right,chest pain,chest,chill,christian,constant stream,cough,crazy,diarrhea,difficulty,doctor,dry cough,eating,fall,feel,feeling better,feeling,fever,flem,flu,fun day,grace,hate,home,hospital,hurt,max,medicine,mercy,mouth,nap,new sickness,nose,nurse,office,pain,pneumonia,problems breathing,rest,runny nose,shock,short breath,sick sucks,sickness,sleeeeeeeep,social life,sore throat,stand,stomach,stream,suck,superpower,temp,terible chest,throat,toilet,tummy,vomitting,

 

Map visualization

We developed a tool in Processing to visualize tweets on the map. The map can be zoomed into and scrolled through. It allows us to enter keywords and provides a timeline slider that can be dragged to visualize posts containing the keywords on the map. The time frame of the slider can be modified. This mode is useful for replaying a sequence of events.  Alternatively the tool also allows us to visualize the posts corresponding to events/interesting topics detected on a given day. On selecting a day and an event, its respective posts on that day are displayed on the map along with a histogram for the 24 hour period. This mode allows us to examine whether the events on a given day overlap geographically or in time of occurrence. Weather icons are displayed at the bottom of the map.

Hypothesis

17th May 2011 (Figure 2): Three events occur on this day: an explosion in Smogtown (pale yellow) and two road accidents, both on different bridges. One of the accidents involves a truck that is totaled (purple). Amongst the topics tweeted about most that day are ‘a cloud of dust’ (bright yellow) arising from the explosion and ‘wind’ (orange), especially in Downtown and Uptown.

Figure 2: The events that trigger the illnesses

18th May 2011(Figure 3): Two events occur in Downtown: a technology convention (cyan) at the Convention center followed by a basketball game (orange) in the Vastopolis Dome. These two events lead to a large gathering of people that sparks the epidemic outbreak. Amongst the symptoms tweeted about at these events are ‘fever’ (green), ‘shortness of breath’ (yellow) and ‘headache’(blue), all exhibiting the same time signatures. One can conclude that the illness is transmitted from person to person, possibly through cough. The agent causing the illness could have originated from the explosion and is likely to be airborne, facilitated by the wind observed the day before. As people commute, the epidemic spreads across the city but remains concentrated in Downtown and Uptown.

Figure 3: The epidemic outbreak in Downtown and Uptown

19th May 2011 (Figure 4):  The epidemic (only fever shown in yellow) continues to spread showing newer symptoms such as ‘runny nose’, ‘sore throat’, ‘aching muscles’ etc. A new set of symptoms (diarrhea in orange and heartburn in cyan) is seen along the banks of Vast River originating from the bridge that had seen a truck accident on the 17th. A truck accident (white), the only event occurring in that area on that day, seems to be unrelated to the new symptoms as they occur much earlier. The logical conclusion is that the contents of the earlier truck accident spilled into the river and are water borne agents for a different illness. We are hence looking at two different illnesses exhibiting different sets of symptoms as explained in Answer 1.

Figure 4: Two distinct illnesses

20th May 2011 (Figure 5): The people tweeting about the epidemic outbreak (purple) are now clustered in the different hospitals in Vastopolis while those affected by diarrhea are likely to visit hospitals soon. Although keywords such as ‘sleeeeeeeep’ and ‘feeling better’ may indicate rest and recovery, the patients still exhibit symptoms such as ‘atrocious cough’ (yellow) and  ‘flem’ indicating that they can still transmit the infection. The hospitals will have to take measures to isolate patients based on the symptoms they exhibit to avoid further infections and contain the epidemic. If the waterborne infection can be treated without visiting a hospital, public announcements should be made to inform the citizens.

Figure 5: Containing the epidemic